Runtime/namespace/client wide worker heartbeat by yuandrew · Pull Request #983 · temporalio/sdk-rust

yuandrew · 2025-08-18T22:34:48Z

NOTE: this targets a worker-heartbeat feature branch to merge into, that way I can make incremental progress and not hit main until the whole feature is ready for both server and SDK.

What was changed

Worker heartbeat duration is a setting configured with RuntimeOptions, but internally the heartbeat mechanism lives on the client. This way we can properly replace a client the SharedNamespaceWorker uses when replacing the client of a regular worker.

A follow up PR will address filling out the rest of the heartbeat data, most of the remaining pieces require an implementation of storing metrics in memory so we can pull that data on each heartbeat.

Why?

Worker heartbeat. I separated WorkerHeartbeat out of #962. Some of the design for this is with Worker Commands in mind, but that will come at a later time.

Checklist

Closes
How was this tested:

fixed up worker_heartbeat unit test

verified with prints (printed SDK name and version) that replace_client works with heartbeating, but couldn't come up with a good unit/integration test to verify this behavior

Any docs updates needed?

cretz

PR titled "process-wide" but actually it's "runtime-wide" it seems. But IMO it should be at the client level, where you have the connection to actually make the invocation and is the thing workers share to communicate to server.

Sushisource

As for the client/runtime thing, I think this is pretty preferential and an internal detail that won't matter much to users.

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps). They are used when creating a worker, but the relationship is more like a worker "has" a client, and more than one worker may have the same client.

That's why I don't really want the heartbeating duration to be a client property. Right now our client options in the lang layer don't directly reference anything worker specific (they do indirectly with interceptors/plugins), and I think it should really stay that way. Specifying a heartbeating duration for a client that is never used with a worker, for example, is weird.

So, that means either this option stays on a Runtime (seems perfectly fine to me) or needs to be passed in when initting the worker, but then there's some last-write-wins problem if users use the same client with different values when passing to workers. So, more reason to put it on the Runtime.

As for literally where this map lives or not - I don't have a huge preference either way. Changing the trait that was used to get slot suppliers for eager to be more generic & used for this I can see. I'm also fine with the current setup.

cretz · 2025-08-20T17:56:20Z

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps)

Right, they are associated with them. That's the point. Same for eager workflow start. A worker is associated with a single client, not some other client in the process/runtime.

@Sushisource - to confirm, you don't believe that we need to make sure a worker's heartbeat should occur on the worker's client? I think that is much simpler to understand for those that want to understand what calls use what clients. This is important because users control lifetimes of clients and handle explicit client replacement to update auth and are the ones that control which clients are for which workers.

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

Sushisource · 2025-08-20T23:28:31Z

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

Sure, I think that's fair (though the actual scenario in the current design where that wouldn't happen seems to be exceptionally rare). I maintain the config option shouldn't live in the client options, though.

cretz · 2025-08-21T14:06:20Z

I maintain the config option shouldn't live in the client options, though.

Do you believe I as a user need to make a whole new Tokio thread pool if I want to disable worker heartbeating only for some workers? Not a rhetorical question, I don't mind if the answer is "yes, should be rare".

Sushisource · 2025-08-21T19:45:08Z

I maintain the config option shouldn't live in the client options, though.

Do you believe I as a user need to make a whole new Tokio thread pool if I want to disable worker heartbeating only for some workers? Not a rhetorical question, I don't mind if the answer is "yes, should be rare".

Technically you wouldn't even have to, right now (at least how this is exposed in Core - you might have to depending on how lang does it). You can make a new CoreRuntime without constructing a new Tokio runtime, but rather re-using an existing one.

So, we can support that. But, yeah even if we don't I'm not hugely concerned.

…mespaceWorker

cretz · 2025-09-09T13:39:22Z

+    /// Optional worker heartbeat interval - This configures the heartbeat setting of all
+    /// workers created using this runtime.
+    #[builder(default = Some(Duration::from_secs(60)))]
+    heartbeat_interval: Option<Duration>,


I still think it's unnecessary to put this on the runtime instead of the client, but it's not a big deal. It disallows hybrid users (e.g. users that connect to self hosted and cloud) from disabling/adjusting heartbeats specific to their environment. From a lang POV they'll have to create an entirely new thread pool to be able to do so. It seems like the only benefit to putting this value on runtime instead of client is "feels" IIUC.

Having said that, not a big deal, up to y'all.

Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.

The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.

One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.

cretz

All looks great to me. Only concern is enabling it for everyone by default.

cretz · 2025-09-17T14:55:09Z

+    telemetry_options: TelemetryOptions,
+    /// Optional worker heartbeat interval - This configures the heartbeat setting of all
+    /// workers created using this runtime.
+    #[builder(default = Some(Duration::from_secs(60)))]


IMO we shouldn't turn on worker heartbeating by default at this time. IMO we should get it into langs, make sure it works as we expect, has no averse effects, etc then consider turning it on by default. Arguably I should be allowed to turn it off/on by client as mentioned at #983 (comment), but that's not a big deal

This is going into a feature branch, I'm turning it on for simplicity, but agree we should not turn on by default in main until everything is known to work as expected

Yep, but, the goal is to have it on by default when released.

Immediately upon release? Meaning there's no period where we try it out as opt-in in lang first? When we did this before, users saw unintentional logs and such. I figure we should at least try it in langs before turning it on for everyone in langs.

Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.

Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.

Sushisource

Nice! I really like the way this is looking despite a bajillion comments. Mostly just polish stuff really. A few logic questions.

Sushisource · 2025-09-18T00:41:53Z

+    telemetry_options: TelemetryOptions,
+    /// Optional worker heartbeat interval - This configures the heartbeat setting of all
+    /// workers created using this runtime.
+    #[builder(default = Some(Duration::from_secs(60)))]


Yep, but, the goal is to have it on by default when released.

Sushisource · 2025-09-18T00:45:12Z

+    /// Optional worker heartbeat interval - This configures the heartbeat setting of all
+    /// workers created using this runtime.
+    #[builder(default = Some(Duration::from_secs(60)))]
+    heartbeat_interval: Option<Duration>,


Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.

The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.

One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.

…ue on same client

cursor · 2025-09-19T19:25:04Z

    r.expect_sdk_name_and_version()
        .returning(|| ("test-core".to_string(), "0.0.0".to_string()));
-    r.expect_get_identity()
+    r.expect_identity()


Bug: Mock Setup Errors in WorkerClient

The worker_set_key() method, recently added to the WorkerClient trait, has issues in its mock setups. In mock_worker_client(), the expect_worker_set_key() is incorrectly configured, passing a function pointer where a closure is expected. Additionally, mock_manual_worker_client() is missing this expectation entirely, which will cause a runtime panic if the method is called.

cursor · 2025-09-19T19:25:04Z

+        self.all_workers
+            .insert(worker.worker_instance_key(), worker);
+
+        Ok(())


Bug: Worker Registration Inconsistency Causes Heartbeat Issues

The ClientWorkerSetImpl::register method can leave an inconsistent state. slot_providers is updated before the worker is fully registered in all_workers. If a later step, like heartbeat setup, fails, try_reserve_wft_slot may attempt to use a non-existent worker, preventing re-registration for that namespace/task queue and permanently losing the worker's heartbeat callback.

Sushisource

Looking like we're good to go here to me I think. Just a few small things, need to fix these lints, and we're ready! Nice

Sushisource · 2025-09-24T16:57:21Z

    #[cfg(test)]
    fn num_providers(&self) -> (usize, usize) {
-        (self.index.len(), self.providers.len())
+        (self.slot_providers.len(), self.slot_providers.len())


No reason to return the same number twice here, the few places tests use this can just get updated.

Sushisource · 2025-09-24T17:00:27Z

+/// For slot managing, there can only be one worker registered per
+/// namespace+queue_name+client, others will get ignored.


Should be changed to say this is an error now

Sushisource · 2025-09-24T17:05:01Z

+    trace_subscriber: Option<Arc<dyn Subscriber + Send + Sync>>,
+}
+
+struct WorkerHeartbeatManager {


nit: Move this down to live next to the registrator

Sushisource · 2025-09-24T17:08:35Z

+    telemetry_options: TelemetryOptions,
+    /// Optional worker heartbeat interval - This configures the heartbeat setting of all
+    /// workers created using this runtime.
+    #[builder(default = Some(Duration::from_secs(60)))]


Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.

Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.

…_key

Sushisource

🎉

…nt level (#1038) * Runtime/namespace/client wide worker heartbeat (#983) * worker heartbeat * Address Spencer's comments * wip use client_identity_override as part of key, added test * Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker * Verified client replacement works, need to update tests and cleanup * formating * clean up * forgot to remove new() now that using builder pattern * Switch to worker_set_key * Replace client test passes, need to write unit tests in worker_registry * cargo test-lint * limit nexus to 1 poller, add tests for worker_registry for heartbeat * PR comments * new test helper * Return error on multi worker register for same namespace and task queue on same client * cargo fmt * Fix registration order, unique task queue for test worker * Remove TEST_Q variable * Missing quotes * CI lint and docker test fix, rename worker_set_key to worker_grouping_key * clippy bug * Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data (#1023) * plumb in memory metrics * simplify worker::new(), fix some heartbeat metrics, new test file * CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then() * Support in-mem metrics when metrics aren't configured * Move sys_info refresh to dedicated thread, use tuner's existing sys info * Format, AtomicCell * Fix unit test * Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added * Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test * recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement * MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests * More tests, sticky cache miss, plugins * Formatting, fix skip_client_worker_set_check * Cursor found a bug * Lower sleep time, add print for debugging * more prints * use semaphores for worker_heartbeat_failure_metrics * skip_client_worker_set_check for all integ workers * Can't use tokio semaphore in workflow code * use signal to test workflow_slots.last_interval_failure_tasks * Use Notify instead of semaphores, fix test flake * Use eventually() instead of a manual sleep * max_outstanding_workflow_tasks 2 * merge * Forgot to commit format fixes * Fix test

yuandrew requested a review from a team as a code owner August 18, 2025 22:34

This comment was marked as outdated.

Sign in to view

cretz reviewed Aug 19, 2025

View reviewed changes

Comment thread core-c-bridge/src/runtime.rs Outdated

Comment thread core-c-bridge/src/runtime.rs Outdated

Comment thread core/src/lib.rs Outdated

cretz reviewed Aug 19, 2025

View reviewed changes

Comment thread core/src/lib.rs Outdated

yuandrew changed the title ~~Process-wide worker heartbeat~~ Runtime-wide worker heartbeat Aug 19, 2025

Sushisource reviewed Aug 20, 2025

View reviewed changes

Comment thread core/src/worker/heartbeat.rs Outdated

Comment thread core/src/worker/heartbeat.rs Outdated

Comment thread core/src/worker/mod.rs Outdated

Comment thread core/src/worker/mod.rs Outdated

Comment thread core/src/lib.rs Outdated

Comment thread core/src/lib.rs Outdated

cretz reviewed Aug 20, 2025

View reviewed changes

Comment thread core/src/worker/mod.rs Outdated

This comment was marked as outdated.

Sign in to view

yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from 9d39b9f to e08370e Compare September 8, 2025 23:52

yuandrew added 6 commits September 8, 2025 16:58

worker heartbeat

0ad0894

Address Spencer's comments

0914836

wip use client_identity_override as part of key, added test

6c578ba

Refactor almost complete, need to plumb through telemetry to SharedNa…

432ed57

…mespaceWorker

Verified client replacement works, need to update tests and cleanup

f9f0938

formating

d324044

yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from e08370e to d324044 Compare September 9, 2025 00:02

This comment was marked as outdated.

Sign in to view

clean up

0fc4e05

yuandrew force-pushed the process-wide-worker-heartbeat-2 branch from 198f5b8 to 0fc4e05 Compare September 9, 2025 06:25

forgot to remove new() now that using builder pattern

c84987c

yuandrew requested review from Sushisource and cretz September 9, 2025 06:41

cretz reviewed Sep 9, 2025

View reviewed changes

yuandrew changed the title ~~Runtime-wide worker heartbeat~~ Runtime/namespace/client wide worker heartbeat Sep 10, 2025

Switch to worker_set_key

61bc366

yuandrew added 2 commits September 16, 2025 23:47

Replace client test passes, need to write unit tests in worker_registry

6384c8c

cargo test-lint

c79ec3f

cretz reviewed Sep 17, 2025

View reviewed changes

Comment thread core/src/worker/heartbeat.rs

limit nexus to 1 poller, add tests for worker_registry for heartbeat

a924a87

Sushisource reviewed Sep 18, 2025

View reviewed changes

yuandrew added 4 commits September 18, 2025 14:18

PR comments

b39194e

new test helper

1dece51

Return error on multi worker register for same namespace and task que…

6c979b1

…ue on same client

cargo fmt

598a1b5

cursor Bot reviewed Sep 19, 2025

View reviewed changes

yuandrew added 3 commits September 22, 2025 13:32

Fix registration order, unique task queue for test worker

b9aef18

Remove TEST_Q variable

29e8231

Missing quotes

77c5625

Sushisource reviewed Sep 24, 2025

View reviewed changes

yuandrew added 2 commits September 24, 2025 10:16

CI lint and docker test fix, rename worker_set_key to worker_grouping…

aed3612

…_key

clippy bug

ad0d3cc

cursor Bot reviewed Sep 24, 2025

View reviewed changes

Comment thread core/src/worker/mod.rs

Sushisource approved these changes Sep 24, 2025

View reviewed changes

yuandrew merged commit 54e2411 into temporalio:worker-heartbeat Sep 24, 2025
17 checks passed

yuandrew mentioned this pull request Oct 17, 2025

Worker Heartbeat: Plumb metrics and migrate to runtime/namespace/client level #1038

Merged

		/// For slot managing, there can only be one worker registered per
		/// namespace+queue_name+client, others will get ignored.

Conversation

yuandrew commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sushisource left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cretz commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Sushisource commented Aug 20, 2025

Uh oh!

cretz commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sushisource commented Aug 21, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

cretz Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sushisource left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuandrew commented Aug 18, 2025 •

edited

Loading

cretz commented Aug 20, 2025 •

edited

Loading

cretz commented Aug 21, 2025 •

edited

Loading

cretz Sep 17, 2025 •

edited

Loading